Data Cleaning and Manipulation

Importing data from SurveyMonkey

Our current survey provider is SurveyMonkey, blackstone contains several functions that makes the process of reading SurveyMonkey data into R a more manageable process and creates a codebook for the data along the way.

SurveyMonkey exports data with two header rows, which does not work with R where tibbles and dataframes can only have one row of names.

Here is how to import data from SurveyMonkey using example data provided with blackstone, this is a fake dataset of a pre (baseline) survey. There are three steps to this process:

  1. Create a codebook.
  2. Edit and save the codebook to create meaningful variable names.
  3. Read in the data and rename the variables with the codebook.

Pre Survey Data

1. Creating the codebook

# File path for pre example data:
pre_data_fp <- blackstone::blackstoneExample("sm_data_pre.csv")
# 1. Create the codebook:
codebook_pre <- blackstone::createCodebook(pre_data_fp)
codebook_pre
## # A tibble: 32 × 5
##   header_1      header_2 combined_header position variable_name
##   <chr>         <chr>    <chr>              <int> <chr>        
## 1 Respondent ID <NA>     Respondent ID          1 respondent_id
## 2 Collector ID  <NA>     Collector ID           2 collector_id 
## 3 Start Date    <NA>     Start Date             3 start_date   
## 4 End Date      <NA>     End Date               4 end_date     
## 5 IP Address    <NA>     IP Address             5 ip_address   
## # ℹ 27 more rows

For this codebook, the first column header_1 is the first header from the SurveyMonkey data, the second column header_2 is the second header, the third column combined_header is the combination of the two headers, position is the column number position for each combined_header, and variable_name is a cleaned up version for combined_header and will be the column to edit to change the column names later on to shorter and more meaningful names.

variable_name will be the column that renames all the variables in the SurveyMonkey data.

2. Editing and Saving the Codebook

# Step 2. Edit the codebook: 
# Set up sequential naming convections for matrix-style questions with shared likert scale response options:
# 8 items that are matrix-style likert scales- turned into a scale called `research`- here is how to easily name them all at once:
# Rows 11 to 18 belong to the "research" matrix question (you will have to look at the codebook and match the header_1 and header_2 to variable_name to change)
research_items <- codebook_pre[["variable_name"]][11:18]
research_names <- paste0("research_", seq_along(research_items)) %>% purrr::set_names(., research_items) # Create a new named vector of names for these columns
# 6 items that are matrix-style likert scales- turned into a scale called `ability`- Rows 19 to 24 named `variable_name`:
ability_items <- codebook_pre[["variable_name"]][19:24]
ability_names <- paste0("ability_", seq_along(ability_items)) %>% purrr::set_names(., ability_items) # Create a new named vector of names for these columns
# 6 items that are matrix-style likert scales- turned into a scale called `ethics`- Rows 19 to 24 named `variable_name`:
ethics_items <- codebook_pre[["variable_name"]][25:29]
ethics_names <- paste0("ethics_", seq_along(ethics_items)) %>% purrr::set_names(., ethics_items) # Create a new named vector of names for these columns
# Edit the `variable_names` column: Use dplyr::mutate() and dplyr::case_match() to change the column `variable_name`:
codebook_pre <- codebook_pre %>% dplyr::mutate(
    variable_name = dplyr::case_match(
        variable_name, # column to match
        'custom_data_1' ~ "unique_id", # changes 'custom_data_1' to "unique_id"
        'to_what_extent_are_you_knowledgeable_in_conducting_research_in_your_field_of_study' ~ "knowledge",
        'with_which_gender_do_you_most_closely_identify' ~ "gender",
        'which_race_ethnicity_best_describes_you_please_choose_only_one' ~ "ethnicity",
        'are_you_a_first_generation_college_student' ~ "first_gen",
        names(research_names) ~ research_names[variable_name], # takes the above named vector and when the name matches, applies new value in that position as replacement.
        names(ability_names) ~ ability_names[variable_name],   # Same for `ability_names`
        names(ethics_names) ~ ethics_names[variable_name],   # Same for `ability_names`
        .default = variable_name # returns default value from original `variable_name` if not changed.
        )
    )
codebook_pre
## # A tibble: 32 × 5
##   header_1      header_2 combined_header position variable_name
##   <chr>         <chr>    <chr>              <int> <chr>        
## 1 Respondent ID <NA>     Respondent ID          1 respondent_id
## 2 Collector ID  <NA>     Collector ID           2 collector_id 
## 3 Start Date    <NA>     Start Date             3 start_date   
## 4 End Date      <NA>     End Date               4 end_date     
## 5 IP Address    <NA>     IP Address             5 ip_address   
## # ℹ 27 more rows

# Write out the edited codebook to save for future use-
# Be sure to double check questions match new names before writing out:
# readr::write_csv(codebook_pre, file = "{filepath-to-codebok}")

3. Import the Data and Rename the Variables with the Codebook

# 3. Read in the data and rename the vars using readRenameData(), passing the file path and the edited codebook:
pre_data <- blackstone::readRenameData(pre_data_fp, codebook = codebook_pre)
pre_data
## # A tibble: 100 × 32
##   respondent_id collector_id start_date end_date   ip_address      email_address
##           <dbl>        <dbl> <date>     <date>     <chr>           <chr>        
## 1  114628000001    431822954 2024-06-05 2024-06-06 227.224.138.113 coraima59@me…
## 2  114628000002    431822954 2024-06-21 2024-06-22 110.241.132.50  mstamm@hermi…
## 3  114628000003    431822954 2024-06-14 2024-06-15 165.58.112.64   precious.fei…
## 4  114628000004    431822954 2024-06-15 2024-06-16 49.34.121.147   ines52@gmail…
## 5  114628000005    431822954 2024-06-15 2024-06-16 115.233.66.80   franz44@hotm…
## # ℹ 95 more rows
## # ℹ 26 more variables: first_name <chr>, last_name <chr>, unique_id <dbl>,
## #   knowledge <chr>, research_1 <chr>, research_2 <chr>, research_3 <chr>,
## #   research_4 <chr>, research_5 <chr>, research_6 <chr>, research_7 <chr>,
## #   research_8 <chr>, ability_1 <chr>, ability_2 <chr>, ability_3 <chr>,
## #   ability_4 <chr>, ability_5 <chr>, ability_6 <chr>, ethics_1 <chr>,
## #   ethics_2 <chr>, ethics_3 <chr>, ethics_4 <chr>, ethics_5 <chr>, …

The SurveyMonkey example data is now imported with names taken from the codebook column variable_name:

names(pre_data)
##  [1] "respondent_id" "collector_id"  "start_date"    "end_date"     
##  [5] "ip_address"    "email_address" "first_name"    "last_name"    
##  [9] "unique_id"     "knowledge"     "research_1"    "research_2"   
## [13] "research_3"    "research_4"    "research_5"    "research_6"   
## [17] "research_7"    "research_8"    "ability_1"     "ability_2"    
## [21] "ability_3"     "ability_4"     "ability_5"     "ability_6"    
## [25] "ethics_1"      "ethics_2"      "ethics_3"      "ethics_4"     
## [29] "ethics_5"      "gender"        "ethnicity"     "first_gen"

Post Survey Data

  • Do that same process over again with the post data, if the variables are all the same you can use the same codebook.

  • For this example there are additional post variables so a new codebook will need to be created to rename the variables when reading in the data with readRenameData().

# File path for pre example data:
post_data_fp <- blackstone::blackstoneExample("sm_data_post.csv")
# 1. Create the codebook using the filepath:
codebook_post <- blackstone::createCodebook(post_data_fp)
codebook_post
## # A tibble: 37 × 5
##   header_1      header_2 combined_header position variable_name
##   <chr>         <chr>    <chr>              <int> <chr>        
## 1 Respondent ID <NA>     Respondent ID          1 respondent_id
## 2 Collector ID  <NA>     Collector ID           2 collector_id 
## 3 Start Date    <NA>     Start Date             3 start_date   
## 4 End Date      <NA>     End Date               4 end_date     
## 5 IP Address    <NA>     IP Address             5 ip_address   
## # ℹ 32 more rows

# Step 2. Edit the codebook: 
# Set up sequential naming convections for matrix-style questions with shared likert scale response options:
# 8 items that are matrix-style likert scales- turned into a scale called `research`- here is how to easily name them all at once:
# Rows 11 to 18 belong to the "research" matrix question (you will have to look at the codebook and match the header_1 and header_2 to variable_name to change)
research_items <- codebook_post[["variable_name"]][11:18]
research_names <- paste0("research_", seq_along(research_items)) %>% purrr::set_names(., research_items) # Create a new named vector of names for these columns
# 6 items that are matrix-style likert scales- turned into a scale called `ability`- Rows 19 to 24 named `variable_name`:
ability_items <- codebook_post[["variable_name"]][19:24]
ability_names <- paste0("ability_", seq_along(ability_items)) %>% purrr::set_names(., ability_items) # Create a new named vector of names for these columns
# 6 items that are matrix-style likert scales- turned into a scale called `ethics`- Rows 19 to 24 named `variable_name`:
ethics_items <- codebook_post[["variable_name"]][25:29]
ethics_names <- paste0("ethics_", seq_along(ethics_items)) %>% purrr::set_names(., ethics_items) # Create a new named vector of names for these columns
# 5 items that are Open-ended follow up when corresponeding ethics items were answered "Strongly disagree"or "Disagree"- Rows 30 to 34 named `variable_name`:
ethics_items_oe <- codebook_post[["variable_name"]][30:34]
ethics_names_oe <- paste0("ethics_", seq_along(ethics_items), "_oe") %>% purrr::set_names(., ethics_items_oe) # Create a new named vector of names for these columns
# Edit the `variable_names` column: Use dplyr::mutate() and dplyr::case_match() to change the column `variable_name`:
codebook_post <- codebook_post %>% dplyr::mutate(
    variable_name = dplyr::case_match(
        variable_name, # column to match
        'custom_data_1' ~ "unique_id", # changes 'custom_data_1' to "unique_id"
        'to_what_extent_are_you_knowledgeable_in_conducting_research_in_your_field_of_study' ~ "knowledge",
        'with_which_gender_do_you_most_closely_identify' ~ "gender",
        'which_race_ethnicity_best_describes_you_please_choose_only_one' ~ "ethnicity",
        'are_you_a_first_generation_college_student' ~ "first_gen",
        names(research_names) ~ research_names[variable_name], # takes the above named vector and when the name matches, applies new value in that position as replacement.
        names(ability_names) ~ ability_names[variable_name],   # Same for `ability_names`
        names(ethics_names) ~ ethics_names[variable_name],   # Same for `ability_names`
        names(ethics_names_oe) ~ ethics_names_oe[variable_name],   # Same for `ethics_names_oe`
        .default = variable_name # returns default value from original `variable_name` if not changed.
        )
    )
codebook_post
## # A tibble: 37 × 5
##   header_1      header_2 combined_header position variable_name
##   <chr>         <chr>    <chr>              <int> <chr>        
## 1 Respondent ID <NA>     Respondent ID          1 respondent_id
## 2 Collector ID  <NA>     Collector ID           2 collector_id 
## 3 Start Date    <NA>     Start Date             3 start_date   
## 4 End Date      <NA>     End Date               4 end_date     
## 5 IP Address    <NA>     IP Address             5 ip_address   
## # ℹ 32 more rows
# Write out the edited codebook to save for future use-
# Be sure to double check questions match new names before writing out:
# readr::write_csv(codebook_post, file = "{filepath-to-codebok}")
# 3. Read in the data and rename the vars using readRenameData(), passing the file path and the edited codebook:
post_data <- blackstone::readRenameData(post_data_fp, codebook = codebook_post)
post_data
## # A tibble: 100 × 37
##   respondent_id collector_id start_date end_date   ip_address      email_address
##           <dbl>        <dbl> <date>     <date>     <chr>           <chr>        
## 1  114628000001    431822954 2024-06-05 2024-06-06 227.224.138.113 coraima59@me…
## 2  114628000002    431822954 2024-06-21 2024-06-22 110.241.132.50  mstamm@hermi…
## 3  114628000003    431822954 2024-06-14 2024-06-15 165.58.112.64   precious.fei…
## 4  114628000004    431822954 2024-06-15 2024-06-16 49.34.121.147   ines52@gmail…
## 5  114628000005    431822954 2024-06-15 2024-06-16 115.233.66.80   franz44@hotm…
## # ℹ 95 more rows
## # ℹ 31 more variables: first_name <chr>, last_name <chr>, unique_id <dbl>,
## #   knowledge <chr>, research_1 <chr>, research_2 <chr>, research_3 <chr>,
## #   research_4 <chr>, research_5 <chr>, research_6 <chr>, research_7 <chr>,
## #   research_8 <chr>, ability_1 <chr>, ability_2 <chr>, ability_3 <chr>,
## #   ability_4 <chr>, ability_5 <chr>, ability_6 <chr>, ethics_1 <chr>,
## #   ethics_2 <chr>, ethics_3 <chr>, ethics_4 <chr>, ethics_5 <chr>, …

Finally, it is important to add pre_ and post_ prefixes to all unique variables before merging the datasets, (i.e. the survey items that differ pre-post- the SM items and demos are identical):

# Pre data:
pre_data <- pre_data %>% dplyr::rename_with(~ paste0("pre_", .), .cols = c(knowledge:ethics_5))
# Pre data:
post_data <- post_data %>% dplyr::rename_with(~ paste0("post_", .), .cols = c(knowledge:ethics_5_oe))

Merging Data

Merge pre-post data by joining on all the variables that are shared in common.

The {dplyr} package has many joining functions, the most commonly use is dplyr::left_join() which keeps all the observations from the first table provided and merges all observations from the second that match.

For most data analysis, we will want to use the post data as the primary table and merge all the pre data since most post surveys drop some participants, so we can run our analysis on complete data.

# left_join() will automatically join by all the shared columns, be sure to include all shared variables that should be identical pre-post to the 'by = join_by()' as an arg
# (otherwise you will get a message about additional variables to be joined by):
sm_data <- post_data %>% dplyr::left_join(pre_data, by = dplyr::join_by(respondent_id, collector_id, start_date, end_date, ip_address, email_address, 
                                                                        first_name, last_name, unique_id, gender, ethnicity, first_gen))
sm_data
## # A tibble: 100 × 57
##   respondent_id collector_id start_date end_date   ip_address      email_address
##           <dbl>        <dbl> <date>     <date>     <chr>           <chr>        
## 1  114628000001    431822954 2024-06-05 2024-06-06 227.224.138.113 coraima59@me…
## 2  114628000002    431822954 2024-06-21 2024-06-22 110.241.132.50  mstamm@hermi…
## 3  114628000003    431822954 2024-06-14 2024-06-15 165.58.112.64   precious.fei…
## 4  114628000004    431822954 2024-06-15 2024-06-16 49.34.121.147   ines52@gmail…
## 5  114628000005    431822954 2024-06-15 2024-06-16 115.233.66.80   franz44@hotm…
## # ℹ 95 more rows
## # ℹ 51 more variables: first_name <chr>, last_name <chr>, unique_id <dbl>,
## #   post_knowledge <chr>, post_research_1 <chr>, post_research_2 <chr>,
## #   post_research_3 <chr>, post_research_4 <chr>, post_research_5 <chr>,
## #   post_research_6 <chr>, post_research_7 <chr>, post_research_8 <chr>,
## #   post_ability_1 <chr>, post_ability_2 <chr>, post_ability_3 <chr>,
## #   post_ability_4 <chr>, post_ability_5 <chr>, post_ability_6 <chr>, …

Data Cleaning

  • Convert all likert scales to factors (for ordering) and all demographics.

  • Create numeric variables from the factor variables for use with statistical tests later on.

  • If applicable, drop all “Missing”/NA observations.

## Knowledge scale
levels_knowledge <- c("Not knowledgeable at all", "A little knowledgeable", "Somewhat knowledgeable", "Very knowledgeable", "Extremely knowledgeable")
## Research Items scale:
levels_confidence <- c("Not at all confident", "Slightly confident", "Somewhat confident", "Very confident", "Extremely confident")
## Ability Items scale: 
levels_min_ext <- c("Minimal", "Slight", "Moderate", "Good", "Extensive")
## Ethics Items scale:
levels_agree5 <- c("Strongly disagree", "Disagree", "Neither agree nor disagree", "Agree", "Strongly agree")

# Demographic levels:
gender_levels <- c("Female","Male","Non-binary", "Do not wish to specify")
ethnicity_levels <- c("White (Non-Hispanic/Latino)", "Asian", "Black",  "Hispanic or Latino", "American Indian or Alaskan Native",
                      "Native Hawaiian or other Pacific Islander", "Do not wish to specify")
first_gen_levels <- c("Yes", "No", "I'm not sure")

# Use mutate() for convert each item in each scale to a factor with vectors above, across() will perform a function for items selected using contains() or can be selected 
# by variables names individually using a character vector: _knowledge or use c("pre_knowledg","post_knowledge")
# Also create new numeric variables for all the likert scale items and use the suffix '_num' to denote numeric:
sm_data <- sm_data %>% dplyr::mutate(dplyr::across(tidyselect::contains("_knowledge"), ~ factor(., levels = levels_knowledge)), # match each name pattern to select to each factor level
                                     dplyr::across(tidyselect::contains("_knowledge"), as.numeric, .names = "{.col}_num"), # create new numeric items for all knowledge items
                                     dplyr::across(tidyselect::contains("research_"), ~ factor(., levels = levels_confidence)), 
                                     dplyr::across(tidyselect::contains("research_"), as.numeric, .names = "{.col}_num"), # create new numeric items for all research items
                                     dplyr::across(tidyselect::contains("ability_"), ~ factor(., levels = levels_min_ext)),
                                     dplyr::across(tidyselect::contains("ability_"), as.numeric, .names = "{.col}_num"), # create new numeric items for all ability items
                                     # select ethics items but not the open_ended responses:
                                     dplyr::across(tidyselect::contains("ethics_") & !tidyselect::contains("_oe"), ~ factor(., levels = levels_agree5)),
                                     dplyr::across(tidyselect::contains("ethics_") & !tidyselect::contains("_oe"), as.numeric, .names = "{.col}_num"), # new numeric items for all ethics items
                                     # individually convert all demographics to factor variables:
                                     gender = factor(gender, levels = gender_levels),
                                     ethnicity = factor(ethnicity, levels = ethnicity_levels),
                                     first_gen = factor(first_gen, levels = first_gen_levels),
                                     )
sm_data
## # A tibble: 100 × 97
##   respondent_id collector_id start_date end_date   ip_address      email_address
##           <dbl>        <dbl> <date>     <date>     <chr>           <chr>        
## 1  114628000001    431822954 2024-06-05 2024-06-06 227.224.138.113 coraima59@me…
## 2  114628000002    431822954 2024-06-21 2024-06-22 110.241.132.50  mstamm@hermi…
## 3  114628000003    431822954 2024-06-14 2024-06-15 165.58.112.64   precious.fei…
## 4  114628000004    431822954 2024-06-15 2024-06-16 49.34.121.147   ines52@gmail…
## 5  114628000005    431822954 2024-06-15 2024-06-16 115.233.66.80   franz44@hotm…
## # ℹ 95 more rows
## # ℹ 91 more variables: first_name <chr>, last_name <chr>, unique_id <dbl>,
## #   post_knowledge <fct>, post_research_1 <fct>, post_research_2 <fct>,
## #   post_research_3 <fct>, post_research_4 <fct>, post_research_5 <fct>,
## #   post_research_6 <fct>, post_research_7 <fct>, post_research_8 <fct>,
## #   post_ability_1 <fct>, post_ability_2 <fct>, post_ability_3 <fct>,
## #   post_ability_4 <fct>, post_ability_5 <fct>, post_ability_6 <fct>, …

The data cleaned in this vignette will be used as the example data in the vignettes Data analysis and Statistical Inference and Data Visualization to further showcase all the functions contained in blackstone.

  • end of vignette Importing and Cleaning Data

Data analysis and Statistical Inference

Likert Scale Table

The most common task is creating frequency tables of counts and percentages for likert scale items, blackstone has the likertTable() for that:

# Research items pre and post frequency table, with counts and percentages: use levels_confidence character vector
# use likertTable to return frequency table, passing the scale_labels: (can also label the individual questions using the arg question_label)
sm_data %>% dplyr::select(tidyselect::contains("research_") & !tidyselect::contains("_num") & where(is.factor)) %>% 
                blackstone::likertTable(., scale_labels = levels_confidence)

Question

Not at all
confident

Slightly
confident

Somewhat
confident

Very
confident

Extremely
confident

n

post_research_1

2 (2%)

26 (26%)

23 (23%)

25 (25%)

24 (24%)

100

post_research_2

12 (12%)

14 (14%)

12 (12%)

14 (14%)

48 (48%)

100

post_research_3

8 (8%)

22 (22%)

21 (21%)

23 (23%)

26 (26%)

100

post_research_4

2 (2%)

24 (24%)

25 (25%)

19 (19%)

30 (30%)

100

post_research_5

14 (14%)

19 (19%)

19 (19%)

19 (19%)

29 (29%)

100

post_research_6

5 (5%)

3 (3%)

24 (24%)

28 (28%)

40 (40%)

100

post_research_7

11 (11%)

10 (10%)

14 (14%)

17 (17%)

48 (48%)

100

post_research_8

4 (4%)

7 (7%)

23 (23%)

23 (23%)

43 (43%)

100

pre_research_1

10 (10%)

8 (8%)

41 (41%)

27 (27%)

14 (14%)

100

pre_research_2

56 (56%)

19 (19%)

8 (8%)

9 (9%)

8 (8%)

100

pre_research_3

32 (32%)

23 (23%)

15 (15%)

20 (20%)

10 (10%)

100

pre_research_4

9 (9%)

24 (24%)

32 (32%)

24 (24%)

11 (11%)

100

pre_research_5

40 (40%)

18 (18%)

21 (21%)

13 (13%)

8 (8%)

100

pre_research_6

17 (17%)

25 (25%)

25 (25%)

16 (16%)

17 (17%)

100

pre_research_7

59 (59%)

13 (13%)

11 (11%)

10 (10%)

7 (7%)

100

pre_research_8

21 (21%)

19 (19%)

23 (23%)

19 (19%)

18 (18%)

100

Using Functional Programming to Speed Up Analysis

Here is one approach to use functional programming from the {purrr} package create many frequency tables at once:

# Another way to make a list of many freq_tables to print out with other data analysis later on, 
# using pmap() to do multiple likertTable() at once:
# Set up tibbles of each set of scales that contain all pre and post data:
# research:
research_df <- sm_data %>% dplyr::select(tidyselect::contains("research_") & !tidyselect::contains("_num") & where(is.factor))
# knowledge:
knowledge_df <- sm_data %>% dplyr::select(tidyselect::contains("_knowledge") & !tidyselect::contains("_num") & where(is.factor))
# ability:
ability_df <- sm_data %>% dplyr::select(tidyselect::contains("ability_") & !tidyselect::contains("_num") & where(is.factor))
# ethics:
ethics_df <- sm_data %>% dplyr::select(tidyselect::contains("ethics_") & !tidyselect::contains("_oe") & !tidyselect::contains("_num") & where(is.factor)) 

# set up tibble with the columns as the args to pass to likertTable(), each row of the column `df` is the tibble of items and 
# each row of `scale_labels` is the vector of likert scale labels:
freq_params <- tibble::tribble(
  ~df,           ~scale_labels, # name of columns (these need to match the names of the arguments in the function that you want to use later in `purrr::pmap()`)
   knowledge_df,  levels_knowledge, 
   research_df,   levels_confidence,  
   ability_df,    levels_min_ext,
   ethics_df,     levels_agree5
)
# Create a named list of frequency tables by using `purrr::pmap()` which takes in a tibble where each column is an argument that is passed to the function, and 
# each row is contains the inputs for a single output, so here each row will be one frequency table that is return to a list and named for easy retrieval later on:
freq_tables <- freq_params %>% purrr::pmap(blackstone::likertTable) %>% 
    purrr::set_names(., c("Knowledge Items", "Research Items", "Ability Items", "Ethics Items"))

# Can select the list by position or by name:
# freq_tables[[1]] # by list position
freq_tables[["Knowledge Items"]] # by name

Question

Not
knowledgeable
at all

A little
knowledgeable

Somewhat
knowledgeable

Very
knowledgeable

Extremely
knowledgeable

n

post_knowledge

7 (7%)

14 (14%)

9 (9%)

24 (24%)

46 (46%)

100

pre_knowledge

65 (65%)

11 (11%)

9 (9%)

9 (9%)

6 (6%)

100

freq_tables[["Research Items"]]

Question

Not at all
confident

Slightly
confident

Somewhat
confident

Very
confident

Extremely
confident

n

post_research_1

2 (2%)

26 (26%)

23 (23%)

25 (25%)

24 (24%)

100

post_research_2

12 (12%)

14 (14%)

12 (12%)

14 (14%)

48 (48%)

100

post_research_3

8 (8%)

22 (22%)

21 (21%)

23 (23%)

26 (26%)

100

post_research_4

2 (2%)

24 (24%)

25 (25%)

19 (19%)

30 (30%)

100

post_research_5

14 (14%)

19 (19%)

19 (19%)

19 (19%)

29 (29%)

100

post_research_6

5 (5%)

3 (3%)

24 (24%)

28 (28%)

40 (40%)

100

post_research_7

11 (11%)

10 (10%)

14 (14%)

17 (17%)

48 (48%)

100

post_research_8

4 (4%)

7 (7%)

23 (23%)

23 (23%)

43 (43%)

100

pre_research_1

10 (10%)

8 (8%)

41 (41%)

27 (27%)

14 (14%)

100

pre_research_2

56 (56%)

19 (19%)

8 (8%)

9 (9%)

8 (8%)

100

pre_research_3

32 (32%)

23 (23%)

15 (15%)

20 (20%)

10 (10%)

100

pre_research_4

9 (9%)

24 (24%)

32 (32%)

24 (24%)

11 (11%)

100

pre_research_5

40 (40%)

18 (18%)

21 (21%)

13 (13%)

8 (8%)

100

pre_research_6

17 (17%)

25 (25%)

25 (25%)

16 (16%)

17 (17%)

100

pre_research_7

59 (59%)

13 (13%)

11 (11%)

10 (10%)

7 (7%)

100

pre_research_8

21 (21%)

19 (19%)

23 (23%)

19 (19%)

18 (18%)

100

freq_tables[["Ability Items"]]

Question

Minimal

Slight

Moderate

Good

Extensive

n

post_ability_1

3 (3%)

20 (20%)

23 (23%)

20 (20%)

34 (34%)

100

post_ability_2

2 (2%)

11 (11%)

11 (11%)

29 (29%)

47 (47%)

100

post_ability_3

11 (11%)

19 (19%)

22 (22%)

15 (15%)

33 (33%)

100

post_ability_4

9 (9%)

6 (6%)

9 (9%)

23 (23%)

53 (53%)

100

post_ability_5

12 (12%)

19 (19%)

16 (16%)

27 (27%)

26 (26%)

100

post_ability_6

6 (6%)

11 (11%)

29 (29%)

29 (29%)

25 (25%)

100

pre_ability_1

9 (9%)

28 (28%)

37 (37%)

15 (15%)

11 (11%)

100

pre_ability_2

24 (24%)

14 (14%)

27 (27%)

22 (22%)

13 (13%)

100

pre_ability_3

26 (26%)

30 (30%)

19 (19%)

19 (19%)

6 (6%)

100

pre_ability_4

43 (43%)

27 (27%)

13 (13%)

4 (4%)

13 (13%)

100

pre_ability_5

32 (32%)

18 (18%)

26 (26%)

17 (17%)

7 (7%)

100

pre_ability_6

26 (26%)

12 (12%)

18 (18%)

26 (26%)

18 (18%)

100

freq_tables[["Ethics Items"]]

Question

Strongly
disagree

Disagree

Neither
agree nor
disagree

Agree

Strongly
agree

n

post_ethics_1

4 (4%)

18 (18%)

24 (24%)

30 (30%)

24 (24%)

100

post_ethics_2

7 (7%)

10 (10%)

13 (13%)

31 (31%)

39 (39%)

100

post_ethics_3

7 (7%)

26 (26%)

14 (14%)

28 (28%)

25 (25%)

100

post_ethics_4

10 (10%)

7 (7%)

6 (6%)

19 (19%)

58 (58%)

100

post_ethics_5

3 (3%)

14 (14%)

25 (25%)

31 (31%)

27 (27%)

100

pre_ethics_1

10 (10%)

16 (16%)

45 (45%)

20 (20%)

9 (9%)

100

pre_ethics_2

20 (20%)

15 (15%)

27 (27%)

23 (23%)

15 (15%)

100

pre_ethics_3

28 (28%)

16 (16%)

30 (30%)

12 (12%)

14 (14%)

100

pre_ethics_4

50 (50%)

17 (17%)

12 (12%)

6 (6%)

15 (15%)

100

pre_ethics_5

14 (14%)

16 (16%)

31 (31%)

30 (30%)

9 (9%)

100

Grouped Demograpic Table

blackstone contains a function to create frequency tables for demographics (combined demographics table) that can also be grouped by a variable, like role or cohort as well: groupedTable().

# Set up labels for variables
# Labels for questions column of table, pass to question_labels argument:
demos_labels <- c('Gender' = "gender",
                  'Race/Ethnicity' = "ethnicity",
                  'First-Generation College Student' = "first_gen")

sm_data %>% dplyr::select(gender, ethnicity, first_gen) %>% # select the demographic vars
                 blackstone::groupedTable(question_labels = demos_labels) # pass the new labels for the 'Question' column.

Question

Response

n = 1001

Gender

Female

47 (47%)

Male

50 (50%)

Non-binary

2 (2%)

Do not wish to specify

1 (1%)

Race/Ethnicity

White (Non-Hispanic/Latino)

36 (36%)

Asian

23 (23%)

Black

7 (7%)

Hispanic or Latino

18 (18%)

American Indian or Alaskan Native

5 (5%)

Native Hawaiian or other Pacific Islander

7 (7%)

Do not wish to specify

4 (4%)

First-Generation
College Student

Yes

59 (59%)

No

39 (39%)

I'm not sure

2 (2%)

1n (%)

Statistical Inference: T-test or Wilcoxon test

In this section, we will run simple statistical tests using the pipe-friendly functions contained in the {rstatix} package.

Single Pre-Post Items

Running Normality Tests, then T-test or Wilcoxon test:

Since a large number of our surveys have a sample size smaller than 30, it is important to check the normality assumption before running any statistical tests. If the data is distributed normally we can use T-tests (parametric tests) for simple hypothesis testing of pre-post items, to see if any changes between the surveys are statistically significant.

If the data is not distributed normally (non-normal) then we will have to use non-parametric statistical tests, like the Wilcoxon test.

Determining if data is distributed normally includes both visual and statistical tests.

Normality Visualizations: Density and QQ (Quantile-Quantile) Plots

The ggpubr package contains functions to create density and QQ plots to visually inspect if data is distributed normally.

# First create a difference score for the pre and post items:
sm_data <- sm_data %>% dplyr::mutate(knowledge_diff = post_knowledge_num - pre_knowledge_num)  # get difference of pre and post scores
# Density plot:
ggpubr::ggdensity(sm_data, "knowledge_diff", fill = "lightgray")

# QQ plot:
ggpubr::ggqqplot(sm_data, "knowledge_diff")

If the data is normally distributed, the density plot would be shaped like a bell curve and the QQ plot would have all of the sample observations (points), lined up along the 45-degree reference line within the shaded confidence interval.

This data is probably not distributed normally, lets run a statistical test to confirm.

Next, run the Shapiro-Wilk’s test. The null hypothesis of this test is the sample distribution is normal. That means that if the test is significant (p-value < 0.05), the distribution is non-normal.

sm_data %>% rstatix::shapiro_test(knowledge_diff)
## # A tibble: 1 × 3
##   variable       statistic           p
##   <chr>              <dbl>       <dbl>
## 1 knowledge_diff     0.885 0.000000301

rstatix::shapiro_test() returns a tibble with three column, the column named p is the p-value for the Shapiro’s test.

Data is not normally distributed for the knowledge items (since the p-value is < 0.05)- use a Wilcoxon test.

Wilcoxon test

There are a couple of ways to run a Wilcoxon test- either a pipe-friendly version rstatix::wilcox_test() or base R has wilcox.test where you must pass each variable as a numeric vector.

# Either use a pipe-friendly version of `wilcox_test()` from `rstatix`, need to covert to long form and have `timing` as a variable:
knowledge_wilcoxon <- sm_data %>% dplyr::select(tidyselect::contains("_knowledge") & tidyselect::contains("_num")) %>%  # select the data
                                  tidyr::pivot_longer(tidyselect::contains(c("pre_", "post_")), names_to = "question", values_to = "response") %>% # pivot to long-form
                                  tidyr::separate(.data$question, into = c("timing", "question"), sep = "_", extra = "merge") %>% # Separate out the prefix to get timing
                                  rstatix::wilcox_test(response ~ timing, paired = TRUE, detailed = TRUE) # Run the Wilcoxon test using column "response" (numeric values) on "timing" (pre or post)
knowledge_wilcoxon
## # A tibble: 1 × 12
##   estimate .y.   group1 group2    n1    n2 statistic        p conf.low conf.high
## *    <dbl> <chr> <chr>  <chr>  <int> <int>     <dbl>    <dbl>    <dbl>     <dbl>
## 1     2.50 resp… post   pre      100   100     3800. 1.23e-13     2.00      3.00
## # ℹ 2 more variables: method <chr>, alternative <chr>
# Or use the simple base R wilcox.test with each pre and post item:
wilcox.test(sm_data[["post_knowledge_num"]], sm_data[["pre_knowledge_num"]], paired = TRUE)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  sm_data[["post_knowledge_num"]] and sm_data[["pre_knowledge_num"]]
## V = 3800, p-value = 0.0000000000001
## alternative hypothesis: true location shift is not equal to 0

Wilcoxon test is significant, there is a significant difference in pre and post scores of knowledge scores.

Composite Scales (multiple items)

Most of the surveys we conduct use composite scales of items that measure any underlying concept. These are average together to create a more reliable measure that can then be used in statistical inference.

Creating Composite Scores

Create composite scores for pre and post data by taking the mean of each set of items, and then get difference scores between pre and post mean:

sm_data <- sm_data %>% dplyr::rowwise() %>% # Get the mean for each individual by row
    dplyr::mutate(pre_research_mean = mean(dplyr::c_across(tidyselect::contains("pre_research_") & tidyselect::contains("_num"))), # pre mean for each individual
                  post_research_mean = mean(dplyr::c_across(tidyselect::contains("post_research_") & tidyselect::contains("_num"))), # post mean for each individual
                  diff_research = post_research_mean - pre_research_mean # get difference scores of pre and post means.
    ) %>% dplyr::ungroup()

Normality Testing of Composite Scales

Run a visual inspection of the difference scores between pre and post mean of the research items:

# Density plot:
ggpubr::ggdensity(sm_data, "diff_research", fill = "lightgray")

# QQ plot:
ggpubr::ggqqplot(sm_data, "diff_research")

Visually, the data appears normally distributed, Next, run the Shapiro-Wilk’s test to confirm.

sm_data %>% rstatix::shapiro_test(diff_research) # not significant, data likely normal
## # A tibble: 1 × 3
##   variable      statistic     p
##   <chr>             <dbl> <dbl>
## 1 diff_research     0.991 0.720

Data is normally distributed for the research composite items (since the p-values is > 0.05)- use a T-test.

T-test of Composite Scales

# Either use a pipe-friendly version of wilcox_test from `rstatix`, need to covert to long form and have `timing` as a variable:
research_t_test <- sm_data %>% dplyr::select(pre_research_mean, post_research_mean) %>% # select the pre and post means for research items
                               tidyr::pivot_longer(tidyselect::contains(c("pre_", "post_")), names_to = "question", values_to = "response") %>% # pivot to long-form
                               tidyr::separate(.data[["question"]], into = c("timing", "question"), sep = "_", extra = "merge") %>% # Separate out the prefix to get timing
                               rstatix::t_test(response ~ timing, paired = TRUE, detailed = TRUE)# Run the T-test using column "response" (numeric values) on "timing" (pre or post)
research_t_test
## # A tibble: 1 × 13
##   estimate .y.      group1 group2    n1    n2 statistic        p    df conf.low
## *    <dbl> <chr>    <chr>  <chr>  <int> <int>     <dbl>    <dbl> <dbl>    <dbl>
## 1     1.02 response post   pre      100   100      15.6 2.43e-28    99    0.890
## # ℹ 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>
# Or use the simple base R wilcox.test with each pre and post item:
t.test(sm_data[["post_research_mean"]], sm_data[["pre_research_mean"]],  paired = TRUE)
## 
##  Paired t-test
## 
## data:  sm_data[["post_research_mean"]] and sm_data[["pre_research_mean"]]
## t = 16, df = 99, p-value <0.0000000000000002
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  0.8899 1.1501
## sample estimates:
## mean difference 
##            1.02

T-test is significant, there is a mean difference in pre and post scores of 1.02.

The vignette Data Visualization will explain how to create visuals using blackstone with the example data analyzed in this vignette.

  • End vignette: Data analysis and Statistical Inference name file: ‘analysis.Rmd’

Data Visualization

blackstone contains many functions for data visualization. This vignette shows you:

Color Palettes

blackstone has functions that create 3 types of charts for data visualization: stacked bar charts, diverging stacked bar charts, and arrow charts.

The functions for stacked bar charts and diverging stacked bar charts can use two different color palettes: a blue sequential palette or a blue-red diverging color palette.

The blue sequential palette should be used for all likert scales that have one clear direction like: Not at all confident, Slightly confident, Somewhat confident, Very confident, Extremely confident

The blue-red diverging color palette should be used if the items have a likert scale that is folded or runs from a negative to positive valence like this: Strongly disagree, Disagree, Neither agree nor disagree, Agree, Strongly agree

The next three sections show examples on how to use these functions.

Stacked Bar Charts

The most common visual that is used with reporting at Blackstone Research and Evaluation is a stacked bar chart, blackstone has a function to that makes creating these charts fast and easy: stackedBarChart().

stackedBarChart() takes in a tibble of factor/character variables to turn into a stacked bar chart. The other requirement is a character vector of scale labels for the likert scale that makes up the items in the tibble (same as the one use to set them up as factors in the data cleaning section).

Pre-post Stacked Bar Chart with Overall n and Percentages

  • By default, stackedBarChart() uses the blue sequential palette to color the bars and sorts the items by the ones with the highest post items with the highest counts/percentages.
# Research Items scale:
levels_confidence <- c("Not at all confident", "Slightly confident", "Somewhat confident", "Very confident", "Extremely confident")

# select variables and pass them to `stackedBarChart()` along with scale_labels.
sm_data %>% dplyr::select(tidyselect::contains("research_") & !tidyselect::contains("_num") & where(is.factor)) %>% # select the factor variables for the research items
    blackstone::stackedBarChart(., scale_labels = levels_confidence, pre_post = TRUE)

Pre-post Stacked Bar Chart with Individual Item n and Counts

# Select variables and pass them to `stackedBarChart()` along with scale_labels, change the arguements `percent_label` and `overall_n` both to FALSE:
sm_data %>% dplyr::select(tidyselect::contains("research_") & !tidyselect::contains("_num") & where(is.factor)) %>% # select the factor variables for the research items
    blackstone::stackedBarChart(., scale_labels = levels_confidence, pre_post = TRUE, percent_label = FALSE, overall_n = FALSE)

Pre-post Stacked Bar Chart with Blue-Red Diverging Color Palette

## Ethics Items scale:
levels_agree5 <- c("Strongly disagree", "Disagree", "Neither agree nor disagree", "Agree", "Strongly agree")

# select variables and pass them to `stackedBarChart()` along with scale_labels, 
# change `fill_colors` to "div" to use the blue-red diverging color palette:
sm_data %>% dplyr::select(tidyselect::contains("ethics_") & !tidyselect::contains("_num") & # select the factor variables for the ethics items
                              !tidyselect::contains("_oe") & where(is.factor)) %>% 
            blackstone::stackedBarChart(., scale_labels = levels_agree5, pre_post = TRUE, fill_colors = "div")

Pre-post Stacked Bar Chart with New Question Labels and Order:

# Question labels as a named vector with the naming structure
# like this: c("new label" = "original variable name"), where the 
# names are the new question labels and the old names are the values without pre or post prefixes:
# Here I will use paste0 to create 8 research items like they appear without prefixes:
research_question_labels <- paste0(paste0("research_", 1:8))
# Set new labels as names of `research_question_labels`
names(research_question_labels) <- c("Research relevant background literature", "Identify a scientific problem", 
                                     "Develop testable and realistic research questions", "Develop a falsifiable hypothesis", 
                                     "Conduct quantitative data analysis", "Design an experiment/Create a research design", 
                                     "Interpret findings and making recommendations", "Scientific or technical writing")

# select variables and pass them to `stackedBarChart()` along with scale_labels, also pass research_question_labels to `question_labels` and set `question_order` to TRUE.
sm_data %>% dplyr::select(tidyselect::contains("research_") & !tidyselect::contains("_num") & where(is.factor)) %>% # select the factor variables for the research items
    blackstone::stackedBarChart(., scale_labels = levels_confidence, pre_post = TRUE, question_labels = research_question_labels, question_order = TRUE)

Single time point Stacked Bar Chart with New Question Labels and Order

# Question labels as a named vector with the naming structure
# like this: c("new label" = "original variable name"), where the 
# names are the new question labels and the old names are the values without pre or post prefixes:
# Here I will use paste0 to create 8 research items like they appear without prefixes:
research_question_labels <- paste0(paste0("post_research_", 1:8))
# Set new labels as names of `research_question_labels`
names(research_question_labels) <- c("Research relevant background literature", "Identify a scientific problem", 
                                     "Develop testable and realistic research questions", "Develop a falsifiable hypothesis", 
                                     "Conduct quantitative data analysis", "Design an experiment/Create a research design", 
                                     "Interpret findings and making recommendations", "Scientific or technical writing")

# select variables and pass them to `stackedBarChart()` along with scale_labels, set pre_post to FALSE (default),
#  also pass research_question_labels to `question_labels` and set `question_order` to TRUE.
sm_data %>% dplyr::select(tidyselect::contains("post_research_") & !tidyselect::contains("_num") & where(is.factor)) %>% # select the factor variables for the research items
    blackstone::stackedBarChart(., scale_labels = levels_confidence, question_labels = research_question_labels, question_order = TRUE)

Diverging Stacked Bar Charts

Another common visual that is used with reporting at Blackstone Research and Evaluation is a diverging stacked bar chart, which I will refer to from now on as a diverging bar chart. blackstone has a function to make this type of chart, it is called: divBarChart().

The diverging bar charts created using divBarChart(), diverge just after the mid-point of the likert scale of the items supplied to the function. See examples below.

divBarChart() has all of the same arguments as stackedBarChart(), so using it has the same requirements.

Pre-post Diverging Bar Chart with Overall n and Percentages

  • By default, divBarChart() uses the blue sequential palette to color the bars and sorts the items by the ones with the highest post items with the highest counts/percentages.
# Research Items scale:
levels_confidence <- c("Not at all confident", "Slightly confident", "Somewhat confident", "Very confident", "Extremely confident")

# select variables and pass them to `divBarChart()` along with scale_labels.
sm_data %>% dplyr::select(tidyselect::contains("research_") & !tidyselect::contains("_num") & where(is.factor)) %>% # select the factor variables for the research items
    blackstone::divBarChart(., scale_labels = levels_confidence, pre_post = TRUE)

Pre-post Diverging Bar Chart with Individual Item n and Counts

# Select variables and pass them to `divBarChart()` along with scale_labels, change the arguements `percent_label` and `overall_n` both to FALSE:
sm_data %>% dplyr::select(tidyselect::contains("research_") & !tidyselect::contains("_num") & where(is.factor)) %>% # select the factor variables for the research items
    blackstone::divBarChart(., scale_labels = levels_confidence, pre_post = TRUE, percent_label = FALSE, overall_n = FALSE)

Pre-post Diverging Bar Chart with Blue-Red Diverging Color Palette

## Ethics Items scale:
levels_agree5 <- c("Strongly disagree", "Disagree", "Neither agree nor disagree", "Agree", "Strongly agree")

# select variables and pass them to `divBarChart()` along with scale_labels, 
# change `fill_colors` to "div" to use the blue-red diverging color palette:
sm_data %>% dplyr::select(tidyselect::contains("ethics_") & !tidyselect::contains("_num") & # select the factor variables for the ethics items
                              !tidyselect::contains("_oe") & where(is.factor)) %>% 
            blackstone::divBarChart(., scale_labels = levels_agree5, pre_post = TRUE, fill_colors = "div")

Pre-post Stacked Bar Chart with New Question Labels and Order

# Question labels as a named vector with the naming structure
# like this: c("new label" = "original variable name"), where the 
# names are the new question labels and the old names are the values without pre or post prefixes:
# Here I will use paste0 to create 8 research items like they appear without prefixes:
research_question_labels <- paste0(paste0("research_", 1:8))
# Set new labels as names of `research_question_labels`
names(research_question_labels) <- c("Research relevant background literature", "Identify a scientific problem", 
                                     "Develop testable and realistic research questions", "Develop a falsifiable hypothesis", 
                                     "Conduct quantitative data analysis", "Design an experiment/Create a research design", 
                                     "Interpret findings and making recommendations", "Scientific or technical writing")

# select variables and pass them to `divBarChart()` along with scale_labels, also pass research_question_labels to `question_labels` and set `question_order` to TRUE.
sm_data %>% dplyr::select(tidyselect::contains("research_") & !tidyselect::contains("_num") & where(is.factor)) %>% # select the factor variables for the research items
    blackstone::divBarChart(., scale_labels = levels_confidence, pre_post = TRUE, question_labels = research_question_labels, question_order = TRUE)

Single time point Stacked Bar Chart with New Question Labels and Order

# Question labels as a named vector with the naming structure
# like this: c("new label" = "original variable name"), where the 
# names are the new question labels and the old names are the values without pre or post prefixes:
# Here I will use paste0 to create 8 research items like they appear without prefixes:
research_question_labels <- paste0(paste0("post_research_", 1:8))
# Set new labels as names of `research_question_labels`
names(research_question_labels) <- c("Research relevant background literature", "Identify a scientific problem", 
                                     "Develop testable and realistic research questions", "Develop a falsifiable hypothesis", 
                                     "Conduct quantitative data analysis", "Design an experiment/Create a research design", 
                                     "Interpret findings and making recommendations", "Scientific or technical writing")

# select variables and pass them to `divBarChart()` along with scale_labels, set pre_post to FALSE (default),
#  also pass research_question_labels to `question_labels` and set `question_order` to TRUE.
sm_data %>% dplyr::select(tidyselect::contains("post_research_") & !tidyselect::contains("_num") & where(is.factor)) %>% # select the factor variables for the research items
    blackstone::divBarChart(., scale_labels = levels_confidence, question_labels = research_question_labels, question_order = TRUE)

Arrow Charts

Arrow charts show the difference in means at two time points, blackstone has two functions that create arrow charts: arrowChart() and arrowChartGroup().

Both use a tibble of numeric pre-post data as the main input, and also require a character vector of scale labels for the numeric scale that makes up the items in the tibble. The rest of the arguments for the two arrow chart functions are the sames as the stacked bar chart functions.

arrowChart()

Arrow Chart with defaults

  • By default, arrowChart() sorts the items/arrows by the ones with the highest post average on down and the arrows are the dark blue color hex code #283251.
# Research Items scale:
levels_confidence <- c("Not at all confident", "Slightly confident", "Somewhat confident", "Very confident", "Extremely confident")

# select variables and pass them to `divBarChart()` along with scale_labels.
sm_data %>% dplyr::select(tidyselect::contains("research_") & tidyselect::contains("_num") & where(is.numeric)) %>% # select the numeric variables for the research items
    blackstone::arrowChart(., scale_labels = levels_confidence)

Arrow Chart with Individual Item n

# Select variables and pass them to `divBarChart()` along with scale_labels, change the arguement `overall_n` both to FALSE:
sm_data %>% dplyr::select(tidyselect::contains("research_") & tidyselect::contains("_num") & where(is.numeric)) %>% # select the numeric variables for the research items
    blackstone::arrowChart(., scale_labels = levels_confidence, overall_n = FALSE)

Arrow Chart with New Question Labels and Order

# Question labels as a named vector with the naming structure
# like this: c("new label" = "original variable name"), where the 
# names are the new question labels and the old names are the values without pre or post prefixes:
# Here I will use paste0 to create 8 research items like they appear without prefixes:
research_question_labels <- paste0(paste0("research_", 1:8))
# Set new labels as names of `research_question_labels`
names(research_question_labels) <- c("Research relevant background literature", "Identify a scientific problem", 
                                     "Develop testable and realistic research questions", "Develop a falsifiable hypothesis", 
                                     "Conduct quantitative data analysis", "Design an experiment/Create a research design", 
                                     "Interpret findings and making recommendations", "Scientific or technical writing")
# Select variables and pass them to `arrowChart()` along with scale_labels, and also pass research_question_labels to `question_labels` and set `question_order` to TRUE:
sm_data %>% dplyr::select(tidyselect::contains("research_") & tidyselect::contains("_num") & where(is.numeric)) %>% # select the numeric variables for the research items
    blackstone::arrowChart(., scale_labels = levels_confidence, question_labels = research_question_labels, question_order = TRUE)

arrowChartGroup()

arrowChartGroup() allows the user to create an arrow chart of pre-post averages grouped by a third variable, while also showing the overall pre-post average as an arrow.

Arrow Chart by Group with defaults

  • By default, arrowChartGroup() sorts the items/arrows by the ones with the highest post average on down and the arrows are colored using the Qualitative Color Palette, which has 11 distinct colors:
    • #E69F00, #56B4E9, #009E73, #CC79A7, #D55E00, #0072B2, #440154FF, #999999, #117733, #283251, #999933
  • arrowChartGroup() returns pre-post averages for each group passed to group_levels as well as an “Overall” which is the whole sample and will always be the color black, also the order of group_levels will also determining the order of the arrows and legend.
# Research Items scale:
levels_confidence <- c("Not at all confident", "Slightly confident", "Somewhat confident", "Very confident", "Extremely confident")

# select variables and pass them to `arrowChartGroup()` along with scale_labels, the grouping variable in `group` and the levels for each group in `group_levels`:
sm_data %>% dplyr::select(gender, tidyselect::contains("research_") & tidyselect::contains("_num") & where(is.numeric)) %>% # select the numeric variables for the research items
    blackstone::arrowChartGroup(., group = "gender", group_levels = gender_levels, scale_labels = levels_confidence)

Arrow Chart with Individual Item n

# Select variables and pass them to `divBarChart()` along with scale_labels, change the argument `overall_n` both to FALSE:
sm_data %>% dplyr::select(gender, tidyselect::contains("research_") & tidyselect::contains("_num") & where(is.numeric)) %>% # select the numeric variables for the research items
    blackstone::arrowChartGroup(., group = "gender", group_levels = gender_levels,scale_labels = levels_confidence, overall_n = FALSE)

Arrow Chart with New Question Labels and Order

# Question labels as a named vector with the naming structure
# like this: c("new label" = "original variable name"), where the 
# names are the new question labels and the old names are the values without pre or post prefixes:
# Here I will use paste0 to create 8 research items like they appear without prefixes:
research_question_labels <- paste0(paste0("research_", 1:8))
# Set new labels as names of `research_question_labels`
names(research_question_labels) <- c("Research relevant background literature", "Identify a scientific problem", 
                                     "Develop testable and realistic research questions", "Develop a falsifiable hypothesis", 
                                     "Conduct quantitative data analysis", "Design an experiment/Create a research design", 
                                     "Interpret findings and making recommendations", "Scientific or technical writing")
# Select variables and pass them to `arrowChart()` along with scale_labels, and also pass research_question_labels to `question_labels` and set `question_order` to TRUE:
sm_data %>% dplyr::select(gender, tidyselect::contains("research_") & tidyselect::contains("_num") & where(is.numeric)) %>% # select the numeric variables for the research items
    blackstone::arrowChartGroup(., group = "gender", group_levels = gender_levels,scale_labels = levels_confidence, question_labels = research_question_labels, question_order = TRUE)

  • End vignette: Data Visualization name file: ‘data_visualization.Rmd’